Efficient Object-Level Visual Context Modeling for Multimodal Machine Translation: Masking Irrelevant Objects Helps Grounding
نویسندگان
چکیده
Visual context provides grounding information for multimodal machine translation (MMT). However, previous MMT models and probing studies on visual features suggest that is less explored in as it often redundant to textual information. In this paper, we propose an Object-level Context modeling framework (OVC) efficiently capture explore translation. With detected objects, the proposed OVC encourages ground desirable objects by masking irrelevant modality. We equip with additional object-masking loss achieve goal. The estimated according similarity between masked source texts so encourage source-irrelevant objects. Additionally, order generate vision-consistent target words, further a vision-weighted OVC. Experiments datasets demonstrate model outperforms state-of-the-art analyses show helps MMT.
منابع مشابه
SHEF-Multimodal: Grounding Machine Translation on Images
This paper describes the University of Sheffield’s submission for the WMT16 Multimodal Machine Translation shared task, where we participated in Task 1 to develop German-to-English and Englishto-German statistical machine translation (SMT) systems in the domain of image descriptions. Our proposed systems are standard phrase-based SMT systems based on the Moses decoder, trained only on the provi...
متن کاملSheffield MultiMT: Using Object Posterior Predictions for Multimodal Machine Translation
This paper describes the University of Sheffield’s submission to the WMT17 Multimodal Machine Translation shared task. We participated in Task 1 to develop an MT system to translate an image description from English to German and French, given its corresponding image. Our proposed systems are based on the state-of-the-art Neural Machine Translation approach. We investigate the effect of replaci...
متن کاملAn Efficient Character-Level Neural Machine Translation
Neural machine translation aims at building a single large neural network that can be trained to maximize translation performance. The encoder-decoder architecture with an attention mechanism achieves a translation performance comparable to the existing state-of-the-art phrase-based systems on the task of English-to-French translation. However, the use of large vocabulary becomes the bottleneck...
متن کاملObject-Level Context Modeling For Scene Classification with Context-CNN
Convolutional Neural Networks (CNNs) have been used extensively for computer vision tasks and produce rich feature representation for objects or parts of an image. But reasoning about scenes requires integration between the low-level feature representations and the high-level semantic information. We propose a deep network architecture which models the semantic context of scenes by capturing ob...
متن کاملExploiting Document-Level Context for Data-Driven Machine Translation
This paper presents a method for exploiting document-level similarity between the documents in the training corpus for a corpusdriven (statistical or example-based) machine translation system and the input documents it must translate. The method is simple to implement, efficient (increases the translation time of an example-based system by only a few percent), and robust (still works even when ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2021
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v35i4.16376